20 research outputs found
Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes
As AI systems become an increasing part of people's everyday lives, it
becomes ever more important that they understand people's ethical norms.
Motivated by descriptive ethics, a field of study that focuses on people's
descriptive judgments rather than theoretical prescriptions on morality, we
investigate a novel, data-driven approach to machine ethics.
We introduce Scruples, the first large-scale dataset with 625,000 ethical
judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex
ethical situation, often posing moral dilemmas, paired with a distribution of
judgments contributed by the community members. Our dataset presents a major
challenge to state-of-the-art neural language models, leaving significant room
for improvement. However, when presented with simplified moral situations, the
results are considerably more promising, suggesting that neural models can
effectively learn simpler ethical building blocks.
A key take-away of our empirical analysis is that norms are not always
clean-cut; many situations are naturally divisive. We present a new method to
estimate the best possible performance on such tasks with inherently diverse
label distributions, and explore likelihood functions that separate intrinsic
from model uncertainty.Comment: 18 pages, 14 tables, 18 figures. Accepted to AAAI 2021. For
associated code and data, see https://github.com/allenai/scruple
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
Commonsense AI has long been seen as a near impossible goal -- until
recently. Now, research interest has sharply increased with an influx of new
benchmarks and models.
We propose two new ways to evaluate commonsense models, emphasizing their
generality on new tasks and building on diverse, recently introduced
benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote
research on commonsense models that generalize well over multiple tasks and
datasets. Second, we propose a novel evaluation, the cost equivalent curve,
that sheds new insight on how the choice of source datasets, pretrained
language models, and transfer learning methods impacts performance and data
efficiency.
We perform extensive experiments -- over 200 experiments encompassing 4800
models -- and report multiple valuable and sometimes surprising findings, e.g.,
that transfer almost always leads to better or equivalent performance if
following a particular recipe, that QA-based commonsense datasets transfer well
with each other, while commonsense knowledge graphs do not, and that perhaps
counter-intuitively, larger models benefit more from transfer than smaller
ones.
Last but not least, we introduce a new universal commonsense reasoning model,
UNICORN, that establishes new state-of-the-art performance across 8 popular
commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA
(90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA
(79.3%).Comment: 27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For
associated code and data see https://github.com/allenai/rainbo
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
We present ATOMIC, an atlas of everyday commonsense reasoning, organized
through 877k textual descriptions of inferential knowledge. Compared to
existing resources that center around taxonomic knowledge, ATOMIC focuses on
inferential knowledge organized as typed if-then relations with variables
(e.g., "if X pays Y a compliment, then Y will likely return the compliment").
We propose nine if-then relation types to distinguish causes vs. effects,
agents vs. themes, voluntary vs. involuntary events, and actions vs. mental
states. By generatively training on the rich inferential knowledge described in
ATOMIC, we show that neural models can acquire simple commonsense capabilities
and reason about previously unseen events. Experimental results demonstrate
that multitask models that incorporate the hierarchical structure of if-then
relation types lead to more accurate inference compared to models trained in
isolation, as measured by both automatic and human evaluation.Comment: AAAI 2019 C
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
Leaderboards have eased model development for many NLP datasets by
standardizing their evaluation and delegating it to an independent external
repository. Their adoption, however, is so far limited to tasks that can be
reliably evaluated in an automatic manner. This work introduces GENIE, an
extensible human evaluation leaderboard, which brings the ease of leaderboards
to text generation tasks. GENIE automatically posts leaderboard submissions to
crowdsourcing platforms asking human annotators to evaluate them on various
axes (e.g., correctness, conciseness, fluency) and compares their answers to
various automatic metrics. We introduce several datasets in English to GENIE,
representing four core challenges in text generation: machine translation,
summarization, commonsense reasoning, and machine comprehension. We provide
formal granular evaluation metrics and identify areas for future research. We
make GENIE publicly available and hope that it will spur progress in language
generation models as well as their automatic and manual evaluation
Instrumental performance and results from testing of the BLAST-TNG receiver, submillimeter optics, and MKID arrays
Polarized thermal emission from interstellar dust grains can be used to map
magnetic fields in star forming molecular clouds and the diffuse interstellar
medium (ISM). The Balloon-borne Large Aperture Submillimeter Telescope for
Polarimetry (BLASTPol) flew from Antarctica in 2010 and 2012 and produced
degree-scale polarization maps of several nearby molecular clouds with
arcminute resolution. The success of BLASTPol has motivated a next-generation
instrument, BLAST-TNG, which will use more than 3000 linear polarization
sensitive microwave kinetic inductance detectors (MKIDs) combined with a 2.5m
diameter carbon fiber primary mirror to make diffraction-limited observations
at 250, 350, and 500 m. With 16 times the mapping speed of BLASTPol,
sub-arcminute resolution, and a longer flight time, BLAST-TNG will be able to
examine nearby molecular clouds and the diffuse galactic dust polarization
spectrum in unprecedented detail. The 250 m detector array has been
integrated into the new cryogenic receiver, and is undergoing testing to
establish the optical and polarization characteristics of the instrument.
BLAST-TNG will demonstrate the effectiveness of kilo-pixel MKID arrays for
applications in submillimeter astronomy. BLAST-TNG is scheduled to fly from
Antarctica in December 2017 for 28 days and will be the first balloon-borne
telescope to offer a quarter of the flight for "shared risk" observing by the
community.Comment: Presented at SPIE Millimeter, Submillimeter, and Far-Infrared
Detectors and Instrumentation for Astronomy VIII, June 29th, 201
Characterization, deployment, and in-flight performance of the BLAST-TNG cryogenic receiver
The Next Generation Balloon-borne Large Aperture Submillimeter Telescope
(BLAST-TNG) is a submillimeter polarimeter designed to map interstellar dust
and galactic foregrounds at 250, 350, and 500 microns during a 24-day Antarctic
flight. The BLAST-TNG detector arrays are comprised of 918, 469, and 272 MKID
pixels, respectively. The pixels are formed from two orthogonally oriented,
crossed, linear-polarization sensitive MKID antennae. The arrays are cooled to
sub 300mK temperatures and stabilized via a closed cycle He sorption fridge
in combination with a He vacuum pot. The detectors are read out through a
combination of the second-generation Reconfigurable Open Architecture Computing
Hardware (ROACH2) and custom RF electronics designed for BLAST-TNG. The
firmware and software designed to readout and characterize these detectors was
built from scratch by the BLAST team around these detectors, and has been
adapted for use by other MKID instruments such as TolTEC and OLIMPO. We present
an overview of these systems as well as in-depth methodology of the
ground-based characterization and the measured in-flight performance.Comment: Presented at SPIE Millimeter, Submillimeter, and Far-Infrared
Detectors and Instrumentation for Astronomy X, December 13-18, 202